Microarchitecture

Introduction

This chapter covers microarchitecture, which is the connection between logic and architecture. Microarchitecture is the specific arrangement of registers, ALUs, finite state machines, memories and other logic building blocks needed to implement an architecture. A particular architecture, such as MIPS, may have many different microarchitectures, each with different trade-offs of performance, cost, and complexity. They all run the same programs, but their internal designs vary widely. We will design three different microarchitectures in this chapter to illustrate the trade-offs.

Architectural State and Instruction Set

Recall that a computer architecture is defined by its instruction set and architectural state. The architectural state for the MIPS processor consists of the program counter and the 32 registers. Any MIPS microarchitecture must contain all of this state. based on the current architectural state, the processor executes a particular instruction with a particular set of data to produce a new architectural state. Some microarchitectures contain additional nonarchitectural state to either simplify the logic or improve performance.

To keep the microarchitectures easy to understand, we consider only a subset of the MIPS instruction set. Specifically, we handle the following instructions:

R-type arithmetic/logic instructions: add, sub, and, or, slt
Memory instructions: lw, sw
Branches: beq

After building the microarchitectures with these instructions, we extend them to handle addi and j. These particular instructions were chosen because they are sufficient to write many interesting programs. One one understands how to implement these instructions, one can expand the hardware to handle others.

Design Process

We will divide our microarchitectures into two interacting parts: the datapath and the control. The datapath operates on words of data. It contains structures such as memories, registers, ALUs, and multiplexers. MIPS is a 32-bit architecture, so we will use a 32-bit datapath. The control unit receives the current instruction from the datapath and tells the datapath how to execute that instruction. Specifically, the control unit produces multiplexer select, register enable, and memory write signals to control the operation of the datapath.

A good way to design a complex system is to start with hardware containing the state elements. These elements include the memories and the architectural state (program counter and registers). Then, add blocks of combinational logic between the state elements to compute the new state based on the current state. The instruction is read from part of memory; load and store instructions then read or write data from another part of memory. Hence, it is often convenient to partition the overall memory into two smaller memories, one containing instructions and the other containing data. The figure shows a block diagram with the four state elements: the program counter, register file, and instruction and data memories.

State elements usually have a reset input to put them into a known state at start-up. This reset is usually not shown.

The program counter is an ordinary 32-bit register. Its output, $PC$ , points to the current instruction. Its input, $PC^{'}$ , indicates the address of the next instruction.

The instruction memory has a single read port. It takes a 32-bit instruction address input, $A$ , and reads the 32-bit data (i.e., instruction) from that address onto the read data output, $RD$ .

The 32-element $\times$ 32-bit register file has two read ports and one write port. The read ports take 5-bit address inputs, $A 1$ and $A 2$ , each specifying one of $2^{5} = 32$ registers as source operands. They read the 32-bit register values onto read data outputs $RD 1$ and $RD 2$ respectively. The write port takes a 5-bit address input, $A 3$ ; a 32-bit write data input, $WD$ ; a write enable input, $WE 3$ ; and a clock. If the write enable is 1, the register file writes the data into the specified register on the rising edge of the clock.

The data memory has a single read/write port. If the write enable, $WE$ , is 1, it writes data $WD$ into address $A$ on the rising edge of the clock. If the write enable is 0, it reads address $A$ onto $RD$ .

The instruction memory, register file, and data memory are all read combinationally. In other words, if the address changes, the new data appears at $RD$ after some propagation delay; no clock is involved. They are written only on the rising edge of the clock. In this fashion, the state of the system is changed only at the clock edge. The address, data, and write enable must be setup sometime before the clock edge and must remain stable until a hold time after the clock edge.Because the state elements change their state only on the rising edge of the clock, they are synchronous sequential circuits. The microprocessor is built of clocked state elements and combinational logic, so it too is a synchronous sequential circuit. Indeed, the processor can be viewed as a giant finite state machine, or as a collection of simpler interacting state machines.

Untitled

MIPS Microarchitectures

In this chapter, we develop three microarchitectures for the MIPS processor architecture: single-cycle, multi-cycle, and pipelined. They differ in the way that the state elements are connected together and in the amount of nonarchitectural state.

The single-cycle microarchitecture executes an entire instruction in one cycle. it is easy to explain and has a simple control unit. Because it completes the operation in one cycle, it does not require any nonarchitectural state. However, the cycle time is limited by the slowest instruction.

The multicycle microarchitecture executes instructions in a series of shorter cycles. Simpler instructions execute in fewer cycles than complicated ones. Moreover, the multicycle microarchitecture reduces the hardware cost by reusing expensive hardware blocks such as adders and memories. For example, the adder may be used on several different cycles for several purposes while carrying out a single instruction. The multicycle processor accomplishes this by adding several nonarchitectural registers to hold intermediate results. The multicycle processor executes only one instruction at a time, but each instruction takes multiple clock cycles.

The pipelined microarchitecture applies to pipelining to the single-cycle microarchitecture. It therefore can execute several instructions simultaneously, improving the throughput significantly. Pipelining must add logic to handle dependencies between simultaneously executing instructions. It also requires nonarchitectural pipeline registers. The added logic and registers are worthwhile; all commercial high-performance processors use pipelining today.

Performance Analysis

A particular processor architecture can have many microarchitectures with different cost and performance trade-offs. The cost depends on the amount of hardware required and the implementation technology. Each year, CMOS processes can pack more transistors on a chip for the same amount of money, and processors take advantage of these additional transistors to deliver more performance. Precise cost calculations require detailed knowledge of the implementation technology, but in general, more gates and more memory mean more dollars.

There are many ways to measure the performance of a computer system, and marketing departments are infamous for choosing the method that makes their computer look fastest, regardless of whether the measurement has any correlation to real world performance.

The only gimmick-free way to measure performance is by measuring the execution time of a program of interest to you. The computer that executes your program fastest has the highest performance. The next best choice is to measure the total execution time of a collection of programs that are similar to those you plan to run; this may be necessary if you haven’t written your program yet or if somebody else who doesn’t have your program is making the measurements. Such collections of programs are called benchmarks, and the execution times of these programs are commonly published to give some indication of how a processor performs.

The execution time of a program, measured in seconds, is given by below equation:

$Execution Time = (# instructions) (\frac{cycles}{instruction}) (\frac{seconds}{seconds})$

The number of instructions in a program depends on the processor architecture. Some architectures have complicated instructions that do more work per instruction, thus reducing the number of instructions in a program. However, these complicated instructions are often slower to execute in hardware. The number of instructions also depends enormously on the cleverness of the programmer.

The number of cycles per instruction, often called CPI, is the number of clock cycles required to execute an average instruction. It is the reciprocal of the throughput (instructions per cycle, or IPC). Different microarchitectures have different CPIs. In these pages we will assume we have an ideal memory system that does not affect the CPI.

The number of seconds per cycle is the clock period, $T_{c}$ . The clock period is determined by the critical path through the logic on the processor. Different microarchitectures have different clock periods. Logic and circuit designs also significantly affect the clock period.

The challenge of the microarchitect is to choose the design that minimizes the execution time while satisfying constraints on cost and/or power consumption. Because microarchitectural decisions affect both CPI and $T_{c}$ and are influenced by logic and circuit designs, determining the best choice requires careful analysis. There are many other factors that affect overall computer performance. For example, the hard disk, the memory, the graphics system, and the network connection may be limiting factors that make processor performance irrelevant.

Single-Cycle Processor

We first design a MIPS microarchitecture that executes instructions in a single cycle. We begin constructing the datapath by connecting the state elements from the above figure with combinational logic that can execute the various instructions. Control signals determine which specific instruction is carried out by the datapath at any given time. The controller contains combinational logic that generates the appropriate control signals based on the current instruction. We conclude by analyzing the performance of the single-cycle processor.

Single-Cycle Datapath

This section gradually develops the single-cycle datapath, adding one piece at a time to the state elements from the above figure. The new connections are emphasized in black (or blue, for new control signals), while the hardware that has already been studies is shown in gray.

The program counter register contains the address of the instruction to execute. The first step is to read this instruction from instruction memory. The figure shows that the PC is simply connected to the address input of the instruction memory. The instruction memory reads out, or fetches, the 32-bit instruction, labeled Instr.

The processor’s actions depend on the specific instruction that was fetched, First we will work out the datapath connections for the lw instruction. The we will consider how to generalize the datapath to handle the other instructions.

Untitled

For a lw instruction, the next step is to read the source register containing the base address. This register is specified in the rs field of the instruction, $Instr_{25 : 21}$ . These bits of the instruction are connected to the address input of one of the register file read ports, $A 1$ , as shown in the figure. The register file reads the register value onto $RD 1$ .

Untitled

The lw instruction also requires an offset. The offset is stored in the immediate field of the instruction, $Instr_{15 : 0}$ . Because the 16-bit immediate might be either positive or negative, it must be sign-extended to 32 bits, as shown in the figure. The 32-bit sign-extended value is called SignImm. Recall that sign extension simply copies the sign bit of a short input into all of the upper bits of the longer output.

Untitled

The processor must add the base address to the offset to find the address to read from memory. The figure introduces an ALU to perform this addition. The ALU receives two operands, SrcA and SrcB. SrcA comes from the register file, and SrcB comes from the sign-extended immediate. The ALU can perform many operations. The 3-bit ALUControl signal specifies the operation. The ALU generates a 32-bit ALUResult and a Zero flag, that indicates whether ALUResult == 0. For a lw instruction, the ALUControl signal should be set to 010 to add the base address and offset. ALUResult is sent to the data mmemory as the address for the load instruction, as shown.

Untitled

The data is read from the data memory onto the ReadData bus, then written back to the destination register in the register file at the end of the cycle, as shown. Port 3 of the register file is the write port. The destination register for the lw instruction is specified in the rt field, $Instr_{20 : 16}$ , which is connected to the port 3 address input, $A 3$ , of the register file. The ReadData bus is connected to the port 3 write data input, $WD 3$ , of the register file. A control signal called RegWrite is connected to the port 3 write enable, $WE 3$ , and is asserted during a lw instruction so that the data value is written into the register file. The write takes place on the rising edge of the clock at the end of the cycle.

Untitled

While the instruction is being executed, the processor must compute the address of the next instruction, PC’. Because instructions are 32 bits = 4 bytes, the next instruction is at PC + 4. The figure uses another adder to increment the PC by 4. The new address is written into the program counter on the next rising edge of the clock. This completes the datapath for the lw instruction.

Untitled

Next, let us extend the datapath to also handle the sw instruction. Like the lw instruction, the sw instruction reads a base address from port 1 of the register and sign-extends an immediate. The ALU adds the base address to the immediate to find the memory address. All of these functions are already supported by the datapath.

The sw instruction also reads a second register from the register file and writes it to the data memory. The figure shows the new connections for this function. The register is specified in the rt field, $Instr_{20 : 16}$ . These bits of the instruction are connected to the second register file read port, $A 2$ . The register value is read onto the $RD 2$ port. It is connected to the write data port of the data memory. The write enable port of the data memory, $WE$ , is controlled by MemWrite. For a sw instruction, MemWrite = 1, to write the data to memory, ALUControl = 010, to add the base address and offset; and RegWrite = 0, because nothing should be written to the register file. Note that data is still read from the address given to the data memory, but that this ReadData is ignored because RegWrite = 0.

Next, consider extending the datapath to handle the R-type instructions add, sub, and, or, and slt. All of these instructions read two registers from the register file, perform some ALU operation on them, and write the result back to a third register file. They differ only in the specific ALU operation. Hence, they can all be handled with the same hardware, using different ALUControl signals.

Untitled

The figure shows the enhanced datapath handling R-type instructions. The register file reads two registers. The ALU performs an operation on these two registers. Before, the ALU always received its SrcB operand from the sign-extended immediate. Now, we add a multiplexer to choose SrcB from either the register file $RD 2$ port or SignImm.

The multiplexer is controlled by a new signal, ALUSrc. ALUSrc is 0 for R-type instructions to choose SrcB from the register file; it is 1 for lw and sw to choose SignImm. This principle of enhancing the datapath’s capabilities by adding a multiplexer to choose inputs from several possibilities is extremely useful. Indeed, we will apply it twice more to complete the handling of R-type instructions.

Before, the register file always got its write data from the data memory. However, R-type instructions write the ALUResult to the register file. Therefore, we add another multiplexer to choose between ReadData and ALUResult. We call its output Result. This multiplexer is controlled by another new signal, MemtoReg. MemtoReg is 0 for R-type instructions to choose Result from the ALUResult; it is 1 for lw to choose ReadData. We don’t care about the value of MemtoReg for sw, because sw does not write to the register file.

Similarly, before, the register to write was specified by the rt field of the instruction, $Instr_{20 : 16}$ . However, for R-type instructions, the register is specified by the rd field, $Instr_{15 : 11}$ . Thus, we add a third multiplexer to choose WriteReg from the appropriate field of the instruction. The multiplexer is controlled by RegDst. RegDst is 1 for R-type instructions to choose WriteReg from the rd field, $Instr_{15 : 11}$ ; it is 0 for lw to choose ReadData. We don’t care about the value of MemtoReg for sw, because sw does not write to the register file.

Finally, let us extend the datapath to handle beq. beq compares two registers. If they are equal, it takes the branch by adding the branch offset to the program counter. Recall that the offset is a positive or negative number, stored in the imm field of the instruction, $Instr_{31 : 26}$ . The offset indicates the number of instructions to branch past. Hence, the immediate must be sign-extended and multiplied by 4 to get the new program counter value: PC’ = PC + 4 + SignImm $\times$ 4.

Untitled

The figure shows the datapath modifications. The next PC value for a taken branch, PCBranch, is computed by shifting SignImm left by 2 bits, then adding it to PCPlus4. The left shift by 2 is an easy way to multiply by 4, because a shift by a constant amount involves just wires. The two registers are compared by computing SrcA - SrcB using the ALU. If ALUResult is 0, as indicated by the Zero flag from the ALU, the registers are equal. We add a multiplexer to choose PC’ from either PCPlus4, or PCBranch. PCBranch is selected if the instruction is a branch and the Zero flag is asserted. Hence, Branch is 1 for beq and 0 for other instructions. For beq, ALUControl = 110, so the ALU performs a subtraction. ALUSrc = 0 to choose SrcB from the register file. RegWrite and MemWrite are 0, because a branch does not write to the register file or memory. We don’t care about the values of RegDst and MemtoReg, because the register file is not written.

This completes the design of the single-cycle MIPS processor datapath. We have illustrated not only the design, but the design process in which the state elements are identified and the combinational logic connecting the state elements is systematically added.

Untitled

Single-Cycle Control

The control unit computes the control signals based on the opcode and funct fields of the instruction, $Instr_{31 : 26}$ and $Instr_{5 : 0}$ . The figure shows the entire single-cycle MIPS processor with the control unit attached to the datapath.

Untitled

Most of the control information comes from the opcode, but R-type instructions also use the funct field to determine the ALU operation. Thus, we will simplify our design by factoring the control unit into two blocks of combinational logic, as shown in the figure. The main decoder computes most of the outputs from the opcode. It also determines a 2-bit ALUOp signal. The ALU decoder uses this ALUOp signal in conjunction with the funct field to compute ALUControl.

Untitled

The meaning of the ALUOp signal is given in the table.

Untitled

The table is a truth table for the ALU decoder. Recall that the meaning of the three ALUControl signals were given earlier, when we designed the ALU. Because ALUOp is never 11, the truth table can use don’t care’s X1 and 1X instead of 01 and 10 to simplify the logic. When ALUOp is 00 or 01, the ALU should add or subtract, respectively. When ALUOp is 10, the decoder examines the funct field to determine the ALUControl. Note that, for the R-type instructions we implement, the first two bits of the funct field are always 10, so we may ignore them to simplify the decoder.

Untitled

The control signals for each instruction were described as we built the datapath. The table is a truth table for the main decoder that summarizes the control signals as a function of the opcode. All R-type instructions use the same main decoder values; they differ only in the ALU decoder output. Recall that, for instructions that do not write to the register file, the RegDst and MemtoReg control signals are don’t cares; the address and data to the register write port do not matter because RegWrite is not asserted.

Untitled

Performance Analysis

Each instruction in the single-cycle processor takes one clock cycle, so the CPI is 1. The critical path for the lw instruction is shown in the figure with a heavy dashed blue line. It starts with the PC loading a new address on the rising edge of the clock. The instruction memory reads the next instruction. The register file reads SrcA. While the register file is reading, the immediate field is sign-extended and selected at the ALUSrc multiplexer to determine SrcB. The ALU adds SrcA and SrcB to find the effective address. The data memory reads from this address. The MemtoReg multiplexer selects ReadData. Finally, Result must setup at the register file before the next rising clock edge, so that it can be properly written. Hence, the cycle time is

$T_{c} = T_{p c q_PC} + t_{mem} + max (t_{R F_{read}}, t_{sign-ext}) + t_{mux} + t_{ALU} + t_{mem} + t_{mux} + t_{RF setup}$

In most implementation technologies, the ALU, memory, and register file accesses are substantially slower than other operations. Therefore, the cycle time simplifies to

$T_{c} = t_{p c q_PC} + 2 t_{mem} + t_{RF read} + 2 t_{mux} + t_{ALU} + t_{RF setup}$

The numerical values of these times will depend on the specific implementation technology. Other instructions have shorter critical paths. For example, R-type instructions do not need to access data memory. However, we are disciplining ourselves to synchronous sequential design, so the clock period is constant and must be long enough to accommodate the slowest instruction.

Untitled

Multicycle Processor

The single-cycle processor has three primary weaknesses. First, it requires a clock cycle long enough to support the slowest instruction, even though most instructions are faster. Second, it requires three adders; adders are relatively expensive circuits, especially if they must be fast. And third, it has separate instruction and data memories, which may not be realistic. Most computers have a single large memory that holds both instructions and data and that can be read and written.

The multicycle processor addresses these weaknesses by breaking an instruction into multiple shorter steps. In each short step, the processor can read or write the memory or register file or use the ALU. Different instructions use different numbers of steps, so simpler instructions can complete faster than more complex ones. The processor needs only one adder; this adder is reused for different purposes on various steps. And the processor uses a combined memory for instructions and data. The instruction is fetched from memory on the first step, and data may be read or written on later steps.

We design a multicycle processor following the same procedure we used for the single-cycle processor. First, we construct a datapath by connecting the architectural state elements and memories with combinational logic. But, this time, we also add nonarchitectural state elements to hold intermediate results between the steps. Then we design the controller. The controller produces different signals on different steps during execution of a single instruction, so it is now a finite state machine rather than combinational logic. We again examine how to add new instructions to the processor. Finally, we analyze the performance of the multicycle processor and compare it to the single-cycle processor.

Multicycle Datapath

Again, we begin our design with the memory and architectural state of the MIPS processor, as shown in the figure. In the single-cycle design, we used separate instruction and data memories because we needed to read the instruction memory and read or write the data memory all in one cycle. Now, we choose to use a combined memory for both instructions and data. This is more realistic, and it is feasible because we can read the instruction in one cycle, then read or write the data in a separate cycle. The PC and register file remain unchanged. We gradually build the datapath by adding components to handle each step of each instruction. The new connections are emphasized in black (or blue for new control signals), whereas the hardware that has already been studied is shown in gray.

Untitled

The PC contains the address of the instruction to execute. The first step is to read this instruction from instruction memory. The figure shows that the PC is simply connected to the address input of the instruction memory. The instruction is read and stored in a new nonarchitectural Instruction Register so that it is available for future cycles. The Instruction Regiter receives an enable signal, called IRWrite, that is asserted when it should be updated with a new instruction.

Untitled

As we did with the single-cycle processor, we will work out the datapath connections for the lw instruction. Then we will enhance the datapath to handle the other instructions. For a lw instruction, the next step is to read the source register containing the base address. This register is specified in the rs field of the instruction, $Instr_{25 : 21}$ . These bits of the instruction are connected to one of the address inputs, $A 1$ , of the register file, as shown in the figure. The register file reads the register onto $RD 1$ . This value is stored in another nonarchitectural register, $A$ .

Untitled

The lw instruction also requires an offset. The offset is stored in the immediate field of the instruction, $Instr_{15 : 0}$ and must be sign-extended to 32-bits, as shown. The 32-bit sign-extended value is called SignImm. To be consistent, we might store SignImm in another nonarchitecural register. However, SignImm is a combinational function of Instr and will not change while the current instruction is being processed, so there is no need to dedicate a register to hold the constant value.

Untitled

The address of the load is the sum of the base address and offset. We use an ALU to compute this sum, as shown in the figure. ALUControl should be set to 010 to perform an addition. ALUResult is stored in a nonarchitectural register called ALUOut.

Untitled

The next step is to load the data from the calculated address in the memory. We add a multiplexer in front of the memory to choose the memory address, Adr, from either the PC or ALUOut, as shown in the figure. The multiplexer select signal is called IorD, to indicate either an instruction or data address. The data read from the memory is stored in another nonarchitectural register called Data. Notice that the address multiplexer permits the reuse of the memory during the lw instruction. On the first step, the address is taken from ALUOut to load the data. Hence, IorD must have different values on different steps. We develop the FSM controller that generates these sequences of control signals later.

Untitled

Finally, the data is written back to the register file, as shown in the figure. The destination register is specified by the rt field of the instruction, $Instr_{20 : 16}$ .

Untitled

While all this is happening, the processor must update the program counter by adding 4 to the old PC. In the single-cycle processor, a separate adder was needed. In the multicycle processor, we can use the existing ALU on one of the steps when it is not busy. To do so, we must insert source multiplexers to choose the PC and the constant 4 as ALU inputs, as shown in the figure. A two-input multiplexer controlled by ALUSrcA chooses either the PC or register $A$ as SrcA. A four-input multiplexer controlled by ALUSrcB chooses either 4 or SignImm as SrcB. We use the other two multiplexer inputs later when we extend the datapath to handle other instructions. To update the PC, the ALU adds SrcA (PC) to SrcB (4), and the result is written back into the program counter register. The PCWrite control signal enables the PC register to be written only on certain cycles.

This completes the datapath for the lw instruction. Next, let’s extend the datapath to also handle the sw instruction. Like the lw instruction, the sw instruction reads a base address from port 1 of the register file and sign-extends the immediate. The ALU adds the base address to the immediate to find the memory address.

Untitled

The only new feature of sw is that we must read a second register from the register file and write it into the memory, as shown in the figure. The register is specified in the rt field of the instruction, $Instr_{20 : 16}$ , which is connected to the second port of the register file. When the register is read, it is stored in a nonarchitectural register, $B$ . On the next step, it is sent to the write data port ( $WD$ ) of the data memory to be written. The memory receives an additional MemWrite control signal to indicate that the write should occur.

Untitled

For R-type instructions, the instruction is again fetched, and the two source registers are read from the register file. Another input of the SrcB multiplexer is used to choose register $B$ as the second source register for the ALU, as shown. The ALU performs the appropriate operation and stores the result in ALUOut. On the next step, ALUOut is written back to the register specified by the rd field of the instruction, $Instr_{15 : 11}$ . This requires two new multiplexers. The MemtoReg multiplexer selects whether $WD 3$ comes from ALUOut (for R-type instructions) or from Data (for lw). The RegDst instruction selects whether the destination register is specified in the rt or rd field of the instruction.

Untitled

For the beq instruction, the instruction is again fetched, and the two source registers are read from the register file. To determine whether the registers are equal, the ALU subtracts the registers and examines the Zero flag. Meanwhile, the datapath must compute the next value of the PC if the branch is taken: PC’ = PC + 4 + SignImm $\times$ 4. In the single-cycle processor, yet another adder was needed to compute the branch address. In the multicycle processor, the ALU can be reused again to save hardware. On one step, the ALU computers PC + 4 and writes it back to the program counter, as was done for other instructions. On another step, the ALU uses this updated PC value to compute PC + SignImm $\times$ 4. SignImm is left-shifted by 2 to multiply it by 4, as shown. The SrcB multiplexer chooses this value and add it to the PC. This sum represents the destination of the branch and is stored in ALUOut. A new multiplexer, controlled by PCSrc, chooses what signal should be sent to PC’. The program counter should be written either when PCWrite is asserted or when a branch is taken. A new control signal, Branch, indicates that the beq instruction is being executed. The branch is taken if Zero is also asserted. Hence, the datapath computes a new PC write enable, called PCEn, which is TRUE either when PCWrite is asserted or when both Branch and Zero are asserted.

This completes the design of the multicycle MIPS processor datapath. The design process is much like that of the single-cycle processor in that hardware is systematically connected between the state elements to handle each instruction. The main difference is that the instruction is executed in several steps. Nonarchitectural registers are inserted to hold the results of each step. In this way, the ALU can be reused several times, saving the cost of extra adders. Similarly, the instructions and data can be stored in one shared memory.

Untitled

Multicycle Control

As in the single-cycle processor, the control unit computes the control signals based on the opcode and funct fields of the instruction, $Instr_{31 : 26}$ and $Instr_{5 : 0}$ . The figure shows the entire multicycle MIPS processor with the control unit attached to the datapath. The datapath is shown in black, and the control unit is shown in blue.

Untitled

As in the single-cycle processor, the control unit is partitioned into a main controller and an ALU decoder, as shown. The ALU decoder is unchanged and follows the truth table from the single-cycle control unit. Now however, the main controller is an FSM that applies the proper control signals on the proper cycles or steps. The sequence of control signals depends on the instruction being executed.

The main controller produces multiplexer select and register enable signals for the datapath. The select signals are MemtoReg, RegDst, IorD, PCSrc, ALUSrcA, and ALUSrcB. The enable signals are IRWrite, MemWrite, PCWrite, Branch, and RegWrite.

To keep the following state transition diagrams readable, only the relevant control signals are listed. Select signals are listed only when their value matters; otherwise, they are don’t cares. Enable signals are listed only when they are asserted; otherwise, they are 0.

Untitled

The first step for any instruction is to fetch the instruction from memory at the address held in the PC. The FSM enters this state on reset. To read memory, IorD = 0, so the address is taken from the PC. IRWrite is asserted to write the instruction into the instruction register, IR. Meanwhile, the PC should be incremented by 4 to point to the next instruction. Because the ALU is not being used for anything else, the processor can use it to compute PC + 4 at the same time that it fetches the instruction. ALUSrcA = 0, so SrcA comes from the PC. ALUSrcB = 01, so SrcB is the constant 4. ALUOp = 00, so the ALU decoder produces ALUControl = 010 to make the ALU add. To update the PC with this new value, PCSrc = 0, and PCWrite is asserted. These control signals are shown in the figure. The data flow on this step is shown below, with the instruction fetch shown using the dashed blue line and the PC increment shown using the dashed gray line.

Untitled

The next step is to read the register file and decode the instruction. The register file always reads the two sources specified by the rs and rt fields of the instruction. Meanwhile, the immediate is sign-extended. Decoding involves examining the opcode of the instruction to determine what to do next. No control signals are necessary to decode the instruction, but the FSM must wait 1 cycle for the reading and decoding to complete, as shown. The new state is highlighted in blue. The data flow is shown below.

Untitled

Now the FSM proceeds to one of several possible states, depending on the opcode. If the instruction is a memory load or store, the multicycle processor computes the address by adding the base address to the sign-extended immediate. This requires ALUSrcA = 1 to select register $A$ and ALUSrcB = 10 to select SignImm. ALUOp = 00, so the ALU adds. The effective address is stored in the ALUOut register for use on the next step. This FSM step is shown in the figure, and the data flow is shown below.

Untitled

If the instruction is lw, the multicycle processor must next read data from memory and write it to the register file. Thee two steps are shown in the figure. To read from memory, IorD = 1 to select the memory address that was just computed and saved in ALUOut. This address in memory is read and saved in the Data register during step S3. On the next step, S4, Data is written to the register file. MemtoReg = 1 to select Data and RegDst = 0 to pull the destination register from the rt field of the instruction. RegWrite is asserted to perform the write, completing the lw instruction. Finally, the FSM returns to the initial state, S0, to fetch the next instruction.

Untitled

From state S2, if the instruction is sw, the data read from the second port of the register file is simply written to memory. IorD = 1 to select the address computed in S2 and saved in ALUOut. MemWrite is asserted to write the memory. Again, the FSM returns to S0 to fetch the next instruction. The added step is shown in the figure.

Untitled

If the opcode indicates an R-type instruction, the multicycle processor must calculate the result using the ALU and write that result to the register file. The figure shows these two steps. In S6, the instruction is executed by selecting the $A$ and $B$ registers and performing the ALU operation indicated by the funct field of the instruction. ALUOp = 10 for all R-type instructions. The ALUResult is stored in ALUOut. In S7, ALUOut is written to the register file, RegDst = 1, because the destination register is specified in the rd field of the instruction. MemtoReg = 0 because the write data, $WD 3$ , comes from ALUOut. RegWrite is asserted to write the register file.

Untitled

For a beq instruction, the processor must calculate the destination address and compare the two source registers to determine whether the branch should be taken. This requires two uses of the ALU and hence might see to demand two new states. Notice however, that the ALU was not used during S1 when the registers were being read. The processor might as well use the ALU at that time to compute the destination address by adding the incremented PC, PC + 4, to SignImm $\times$ 4, as shown. ALUSrcA = 0 to select the incremented PC, ALUSrcB = 11 to select SignImm $\times$ 4, and ALUOp = 00 to add. The destination address is stored in ALUOut. If the instruction is not beq, the computed address will not be used in subsequent cycles, but its computation was harmless. In S8, the processor compares the two registers by subtracting them and checking to determine whether the result is 0. If it is, the processor branches to the address that was just computed. ALUSrcA = 1 to select register $A$ ; ALUSrcB = 00 to select register $B$ ; ALUOp = 01 to subtract; PCSrc = 1 to take the destination address from ALUOut, and Branch = 1 to update the PC with this address if the ALU result is 0.

Untitled

Putting these steps together, the figure shows the complete main controller state transition diagram for the multicycle processor. Converting it to hardware is a straightforward but tedious task using the techniques in Sequential Logic. An easier way however is to code the FSM in Verilog and synthesize it using the techniques in Hardware Description Languages.

Untitled

Performance Analysis

The execution time of an instruction depends on both the number of cycles it uses and the cycle time. Whereas the single-cycle processor performed all instructions in one cycle, the multicycle processor uses varying numbers of cycles for the various instructions. However, the multicycle processor does less work in a single cycle and, thus, has a shorter cycle time.

The multicycle processor requires three cycles for beq instructions, four cycles for sw, and R-type instructions, and five cycles for lw instructions. The CPI depends on the relative likelihood that each instruction is used.

Recall that we designed the multicycle processor so that each cycle involved one ALU operation, memory access, or register file access. Let us assume that the register file is faster than the memory and that writing memory is faster than reading memory. Examining the datapath reveals two possible critical datapaths that would limit the cycle time:

$T_{c} = t_{p c q} + t_{mux} + max (t_{ALU} + t_{mux}, t_{mem}) + t_{setup}$

The numerical values of these times will depend on the specific implementation technology.

Pipelined Processor

Pipelining is a powerful way to improve the throughput of a digital system. We design a pipelined processor by subdividing the single-cycle processor into five pipeline stages. Thus, five instruction can execute simultaneously, one in each stage. Because each stage has only one-fifth of the entire logic, the clock frequency is almost five times faster. Hence, the latency of each instruction is ideally unchanged, but the throughput is ideally five times better. Microprocessors execute millions or billions of instructions per second, so throughput is more important than latency. Pipelining introduces some overhead, so the throughput will not be quite as high ass we might ideally desire, but pipelining nevertheless gives such great advantage for so little cost that all modern high-performance microprocessors are pipelined.

Reading and writing the memory and register file and using the ALU typically constitute the biggest delays in the processor. We choose five pipeline stages so that each stage involves exactly one of these slow steps. Specifically, we call the five stages Fetch, Decode, Execute, Memory, and Writeback. They are similar to the five steps that the multicycle processor used to perform lw. In the Fetch stage, the processor reads the instruction to produce the control signals. In the Execute stage, the processor performs a computation with the ALU. In the Memory stage, the processor reads or writes data memory. Finally, in the Writeback stage, the processor writes the result to the register file, when applicable.

The figure shows a timing diagram comparing the single-cycle and pipelined processors. Time is on the horizontal axis, and instructions are on the vertical axis. In the single-cycle processor, a), the first instruction is read from memory at time 0; next the operands are read from the register file; and then the ALU executes the necessary computation. Finally, the data memory may be accessed, and the result is written back to the register file by 950 ps. The second instruction begins when the first completes. Hence, in this diagram, the single-cycle processor has an instruction latency of 250 + 150 + 200 + 250 + 100 = 950 ps and a throughput of 1 instruction per 950 ps.

In the pipelined processor, b), the length of a pipeline stage is set at 250 ps by the slowest stage, the memory access (in the Fetch or Memory stage). At time 0, the first instruction is fetched from memory. At 250 ps, the first instruction enters the Decode stage, and a second instruction is fetched. At 500 ps, the first instruction executes, the second instruction enters the Decode stage, and a third instruction is fetched. And so forth, until all the instructions complete. The instruction latency is 5 $\times$ 250 ps = 1250 ps. The throughput is 1 instruction per 250 ps. Because the stages are not perfectly balanced with equal amounts of logic, the latency is slightly longer for the pipelined than for the single-cycle processor. Simmilarly, the throughput is not quite five times as great for a five-stage pipeline as for the single-cycle processor. Nevertheless, the throughput advantage is substantial.

Untitled

The figure shows an abstracted view of the pipeline in operation in which each stage is represented pictorially. Each pipeline stage is represented with its major component - instruction memory (IM), register file (RF) read, ALU execution, data memory (DM), and register file writeback - to illustrate the flow of instructions through the pipeline. Reading across a row shows the clock cycles in which a particular instruction is in each stage. For example, the sub instruction is fetched in cycle 3 and executed in cycle 5. Reading down a column shows what the various pipeline stages are doing on a particular cycle. For example, in cycle 6, the or instruction is being fetched from instruction memory, while $s1 is being read from the register file, the ALU is computing $t5 $AND$ $t6, the data memory is idle, and the register file is writing a sum to $s3. Stages are shaded to indicate when they are used. For example, the data memory is used by lw in cycle 4 and by sw in cycle 8. The instruction memory and ALU are used in every cycle. The register file is written by every instruction except sw. We assume that in the pipelined processor the register file is written in the first part of a cycle and read in the second part, as suggested by the shading. This way, data can be written and read back in a single cycle.

A central challenge in pipelined systems is handling hazards that occur when the results of one instruction are needed by a subsequent instruction before the former instruction has completed. For example, if the add in the figure used $s2 instead of $t2, a hazard would occur because the $s2 register has not been written by the lw by the time it is read by the add. This section explores forwarding, stalls, and flushes as methods to resolve hazards.

Untitled

Pipelined Datapath

The pipelined datapath is formed by chopping the single-cycle datapath into five stages separated by pipeline registers. The figure a) shows the single-cycle datapath stretched out to leave room for the pipeline registers. b) shows the pipelined datapath formed by inserting four pipeline registers to separate the datapath into five stages. The stages and their boundaries are indicated in blue. Signals are given a suffix (F, D, E, M, or W) to indicate the stage in which they reside.

The register file is peculiar because it is read in the Decode stage and written in the Writeback stage. This feedback will lead to pipeline hazards, which will be discussed later.

One of the subtle but critical issues in pipelining is that all signals associated with a particular instruction must advance through the pipeline in unison. Figure b) has an error related to this issue.

The error is in the register file write logic, which should operate in the Writeback stage. The data value comes from ResultW, a Writeback stage signal. But the address comes from WriteRegE, an execute stage signal. In the pipeline diagram above, during cycle 5, the result of the lw instruction would be incorrectly written to register $s4 rather than register $s2.

Untitled

The figure shows a corrected datapath. The WriteReg signal is now pipelined along through the Memory and Writeback stages, so it remains in sync with the rest of the instruction. WriteRegW and ResultW are fed back together to the register file in the Writeback stage.

The PC’ logic is also problematic, because it might be updated with a Fetch or a Memory stage signal (PCPlus4F or PCBranchM). This control hazard will be fixed later.

Untitled

Pipelined Control

The pipelined processor takes the same control signals as the single-cycle processor and therefore uses the same control unit. The control unit examines the opcode and funct fields of the instruction in the Decode stage to produce the control signals, as was described earlier. These control signals must be pipelined along with the data so that they remain synchronized with the instruction.

The entire pipelined processor with control is shown in the figure. RegWrite must be pipelined into the Writeback stage before it feeds back to the register file, just as WriteReg was pipelined above.

Untitled

Hazards

In a pipelined system, multiple instructions are handled concurrently. When one instruction is dependent on the results of another that has not yet completed, a hazard occurs.

The register file can be read and written in the same cycle. Let us assume that the write takes place during the first half of the cycle and the read take place during the second half of the cycle, so that a register can be written and read back in the same cycle without introducing a hazard.

The figure illustrates hazards that occur when one instruction writes a register ($s0) and subsequent instructions read this register. This is called a read after write (RAW) hazard. The add instruction writes a result into $s0 in the first half of cycle 5. However, the and instruction reads $s0 on cycle 3, obtaining the wrong value. The or instruction reads $s0 on cycle 4, again obtaining the wrong value. The sub instruction reads $s0 in the second half of cycle 5, obtaining the correct value, which was written in the first half of cycle 5. Subsequent instructions also read the correct value of $s0. The diagram shows that hazards may occur in this pipeline when an instruction writes a register and either of the two subsequent instructions read that register. Without special treadment, the pipeline will compute the wrong result.

On closer inspection, however, observe that the sum from the add instruction is computed by the ALU in cycle 3 and is not strictly needed by the and instruction until the ALU uses it in cycle 4. In principle, we should be able to forward the result from one instruction to the next to resolve the RAW hazard without slowing down the pipeline. In other situations explored later in this section, we may have to stall the pipeline to give time for a result to be computed before the subsequent instruction uses the result. In any event, something must be done to solve hazards so that the program executes correctly despite the pipelining.

Hazards are classified as data hazards or control hazards. A data hazard occurs when an instruction tries to read a register that has not yet been written back by a previous instruction. A control hazard occurs when the decision of what instruction to fetch next has not been made by the time the fetch takes place. In the remainder of this section, we will enhance the pipelined processor with a hazard unit that detects hazards and handles them appropriately, so that the processor executes the program correctly.

Untitled

Solving Data Hazards with Forwarding

Some data hazards can be solved by forwarding (also called bypassing) a result from the Memory or Writeback stage to a dependent instruction in the Execute stage. This requires adding multiplexers in front of the ALU to select the operand from either the register file or the Memory or Writeback stage. The figure illustrates this principle. In cycle 4, $s0 is forwarded from the memory stage of the add instruction to the Execute stage of the dependent and instruction. In cycle 5, $s0 is forwarded from the Writeback stage of the add instruction to the Execute stage of the dependent or instruction.

Untitled

Forwarding is necessary when an instruction in the Execute stage has a source register matching the destination register of an instruction in the Memory or Writeback stage. The figure modifies the pipelined processor to support forwarding. It adds a hazard detection unit and two forwarding multiplexers. The hazard detection unit receives the two source registers from the instruction in the Execute stage and the destination registers from the instructions in the Memory and Writeback stages. It also receives the RegWrite signals from the Memory and Writeback stages to know whether the destination register will actually be written. Note that the RegWrite signals are connected by name. In other words, rather than cluttering up the diagram with long wires running from the control signals at the top to the hazard unit at the bottom, the connections are indicated by a short stub of wire labeled with the control signal name to which it is connected.

The hazard detection unit computes control signals for the forwarding multiplexers to choose operands from the register file or from the results in the Memory or Writeback stage. It should forward from a stage if that stage will write a destination register and the destination register matches the source register. However, $s0 is hardwired to 0 and should never be forwarded. If both the memory and Writeback stages contain matching destination registers, the Memory stage should have priority because it contains the more recently executed instruction. In summary, the function of the forwarding logic for SrcA is given below. The forwarding logic for SrcB is identical except that it checks rt rather than rs.

if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then
    ForwardAE = 10
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then
    ForwardAE = 01
else ForwardAE = 00

Untitled

Solving Data Hazards With Stalls

Forwarding is sufficient to solve RAW data hazards when the result is computed in the Execute stage of an instruction, because its result can then be forwarded to the Execute stage of the next instruction. Unfortunately, the lw instruction does not finish reading data until the end of the Memory stage, so its result cannot be forwarded to the Execute stage of the next instruction. We say that the lw instruction has a two-cycle latency, because a dependent instruction cannot use its result until two cycles later. The figure shows this problem. The lw instruction receives data from memory at the end of cycle 4. But the and instruction needs that data as a source operand at the beginning of cycle 4. There is no way to solve this hazard with forwarding.

Untitled

The alternative solution is to stall the pipeline, holding up operation until the data is available. The figure shows stalling the dependent instruction (and) in the Decode stage. and enters the Decode stage in cycle 3 and stalls there through cycle 4. The subsequent instruction (or) must remain in the Fetch stage during both cycles as well, because the Decode stage is full.

In cycle 5, the result can be forwarded from the Writeback stage of lw to the Execute stage of and. In cycle 6, source $s0 of the or instruction is read directly from the register file, with no need for forwarding.

Notice that the Execute stage is unused in cycle 4. Likewise, Memory is unused in cycle 5 and Writeback is unused in cycle 6. This unused stage propagating through the pipeline is called a bubble, and it behaves like a nop instruction. The bubble is introduced by zeroing out the Execute stage control signals during a Decode stall so that the bubble performs no action and changes no architectural state.

In summary, stalling a stage is performed by disabling the pipeline register, so that the contents do not change. When a stage is stalled, all previous stages must also be stalled, so that no subsequent instructions are lost. The pipeline register directly after the stalled stage must be cleared to prevent bogus information from propagating forward. Stalls degrade performance, so they should only be used when necessary.

Untitled

The figure modifies the pipelined processor to add stalls for lw data dependencies. The hazard unit examines the instruction in the Execute stage. If it is lw and its destination register (rtE) matches either source operand of the instruction in the Decode stage (rsD or rtD), that instruction must be stalled in the Decode stage until the source operand is ready.

Stalls are supported by adding enable inputs to the Fetch and Decode pipeline registers and a synchronous reset/clear input to the Execute pipeline register. When a lw stall occurs, StallD and StallF are asserted to force the Decode and Fetch stage pipeline registers to hold their old values. FlushE is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble.

The MemtoReg signal is asserted for the lw instruction. Hence, the logic to compute the stalls and flushes is

lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemToRegE
StallF = StallD = FlushE = lwstall

Untitled

Solving Control Hazards

The beq instruction presents a control hazard: the pipelined processor does not know what instruction to fetch next, because the branch decision has not been made by the time the next instruction is fetched.

One mechanism for dealing with the control hazard is to stall the pipeline until the branch decision is made (i.e., PCSrc is computed). Because the decision is made in the Memory stage, the pipeline would have to be stalled for three cycles at every branch. This would severely degrade the system performance.

An alternative is to predict whether the branch will be taken and begin executing instructions based on the prediction. Once the branch decision is available, the processor can throw out the instructions if the prediction was wrong. In particular, suppose that we predict that branches are not taken and simply continue executing the program in order. If the branch should have been taken, the three instructions following the branch must be flushed (discarded) by clearing the pipeline registers for those instructions. These wasted instruction cycles are called the branch misprediction penalty.

The figure shows such a scheme, in which a branch from address 20 to address 64 is taken. The branch decision is not made until cycle 4, by which point the and, or, and sub instructions at addresses 24, 28 and 2C have already been fetched. These instructions must be flushed, and the slt instruction is fetched from address 64 in cycle 5. This is somewhat of an improvement, but flushing so many instructions when the branch is taken still degrades performance.

We could reduce the branch misprediction penalty if the branch decision could be made earlier. Making the decision simply requires comparing the values of two registers. Using a dedicated equality comparator is much faster than performing a subtraction and zero detection. If the comparator is fast enough, it could be moved back into the Decode stage, so that the operands are read from the register file and compared to determine the next PC by the end of the Decode stage.

Untitled

The figure shows the pipeline operation with the early branch decision being made in cycle 2. In cycle 3, the and instruction is flushed and the slt instruction is fetched. Now the branch misprediction penalty is reduced to only one instruction rather than three.

Untitled

The figure modifies the pipelined processor to move the branch decision earlier and handle control hazards. An equality comparator is added to the Decode stage and the PCSrc $AND$ gate is moved earlier, so that PCSrc can be determined in the Decoder stage rather than the Memory stage. The PCBranch adder must also be moved into the Decode stage so that the destination address can be computed in time. The synchronous clear input connected to PCSrcD is added to the Decode stage pipeline register so that the incorrectly fetched instruction can be flushed when a branch is taken.

Unfortunately, the early branch decision hardware introduces a new RAW data hazard. Specifically, if one of the source operands for the branch was computed by a previous instruction and has not yet been written into the register file, the branch will read the wrong operand value from the register file. As before, we can solve the data hazard by forwarding the correct value if it is available or stalling the pipeline until the data is ready.

Untitled

The figure shows the modifications to the pipelined processor needed to handle the Decode stage data dependency. If a result is in the Writeback stage, it will be written in the first half of the cycle and read during the second half, so no hazard exists. If the result of an ALU instruction is in the Memory stage, it can be forwarded to the equality comparator through two new multiplexers. If the result of an ALU instruction is in the Execute stage or the result of a lw instruction is in the Memory stage, the pipeline must be stalled at the Decode stage until the result is ready.

The function of the Decide stage forwarding logic is given below.

ForwardAD = (rsD != 0) AND (rsD == WriteRegM) AND RegWriteM
ForwardBD = (rtD != 0) AND (rtD == WriteRegM) AND RegWriteM

The function of the stall detection logic for a branch is given below.

The processor must make a branch decision in the Decode stage. If either of the sources of the branch depends on an ALU instruction in the Execute stage or on a lw instruction in the Memory stage, the processor must stall until all the sources are ready.

branchstall = (BranchD AND RegWriteE AND (WriteRegE == rsD OR WriteRegE == rtD)) OR (BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD))

Now the processor might stall due to either a load or a branch hazard:

StallF = StallD = FlushE = lwstall OR branchstall

Untitled

Hazard Summary

In summary, RAW data hazards occur when an instruction depends on the result of another instruction that has not yet been written into the register file. The data hazards can be resolved by forwarding if the result is computed soon enough; otherwise they require stalling the pipeline until the result is available. Control hazards occur when the decision of what instruction to fetch has not been made by the time the next instruction must be fetched. Control hazards are solved by predicting which instruction should be fetched and flushing the pipeline if the prediction is later determined to be wrong. Moving the decision as early as possible minimizes the number of instructions that are flushed on a misprediction.

The figure show the complete pipelined processor handling all of the hazards.

Untitled

Performance Analysis

The pipelined processor ideally would have a CPI of 1, because a new instruction is issued every cycle. However, a stall or a flush wastes a cycle, so the CPI is slightly higher and depends on the specific program being executed.

We can determine the cycle time by considering the critical path in each of the five pipeline stages. Recall that the register file is written in the first half of the Writeback cycle and read in the second half of the decode cycle. Therefore, the cycle time of the Decode and Writeback stages is twice the time necessary to do the half-cycle of work.

Untitled

The pipelined processor is similar in hardware requirements to the single-cycle processor, but it adds a substantial number of pipeline registers, along with multiplexers and control logic to resolve hazards.

HDL Representation

This section presents HDL code for the single-cycle MIPS processor, supporting all of the instructions discussed above, including addi and j. The code illustrates good coding practices for a moderately complex system.

In this section, the instruction and data memories are separated from the main processor and connected by address and data busses. This is more realistic, because most real processors have external memory. it also illustrates how the processor can communicate with the outside world.

The processor is composed of a datapath and a controller. The controller, in turn, is composed of the main decoder and the ALU decoder. The figure shows a block diagram of the single-cycle MIPS processor interfaced to external memories. The HDL code is partitioned into several sections.

Untitled

Single-Cycle Processor

The main modules of the single-cycle MIPS processor module are given in the following Verilog examples.

Single-Cyle MIPS Processor

module mips (input         clk, reset,
             output [31:0] pc,
             input  [31:0] instr,
             output        memwrite,
             output [31:0] aluwrite, writedata,
             input  [31:0] readdata);
  wire memtoreg, branch,  alusrc, regdst, regwrite, jump;
  wire [2:0] alucontrol;
  controller c (instr[31:26], instr[5:0], zero, memtoreg, memwrite, pcsrc, alusrc, regdst, regwrite, jump, alucontrol);
  datapath dp (clk, reset, memtoreg, pcsrc, alusrc, regdst, regwrite, jump, alucontrol, zero, pc, instr, aluout, writedata, readdata);
endmodule

Controller

module controller (input  [5:0] op, funct,
                   input        zero,
                   output       memtoreg, memwrite,
                   output       pcsrc, alusrc,
                   output       regdst, regwrite,
                   output       jump,
                   output [2:0] alucontrol);
  wire [1:0] aluop;
  wire branch;
  maindec md (op, memtoreg, memwrite, branch, alusrc, regdst, regwrite, jump, aluop);
  aludec ad (funct, aluop, alucontrol);
  assign pcsrc = branch & zero;
endmodule

Main Decoder

module maindec (input  [5:0] op,
                output       memtoreg, memwrite,
                output       branch, alusrc,
                output       regdst, regwrite,
                output       jump,
                output [1:0] aluop);
  reg [8:0] controls;
  assign {regwrite, regdst, alusrc, branch, memwrite, memtoreg, jump, aluop} = controls;
  always @ (*)
    case (op)
      6'b000000: controls <= 9b110000010; // R-type
      6'b100011: controls <= 9b101001000; // ls
      6'b101011: controls <= 9b001010000; // sw
      6'b000100: controls <= 9b000100001; // beq
      6'b001000: controls <= 9b101000000; // addi
      6'b000010: controls <= 9b000000100; // j
      default: controls <= 9bxxxxxxxxx;   // ???
    endcase
endmodule

ALU Decoder

module aludec (input      [5:0] funct,
               input      [1:0] aluop,
               output reg [2:0] alucontrol);
  always @ (*)
    case (aluop)
      2'b00: alucontrol <= 3'b010; // add
      2'b01: alucontrol <= 3'b110; // sub
      default: case (funct)        // R-type
          6'b100000: alucontrol <= 3'b010; // add
          6'b100010: alucontrol <= 3'b110; // sub
          6'b100100: alucontrol <= 3'b000; // and
          6'b100101: alucontrol <= 3'b001; // or
          6'b101010: alucontrol <= 3'b111; // slt
          default: alucontrol <= 3'bxxx;   // ???
        endcase
    endcase
endmodule

Datapath

module datapath (input         clk, reset,
                 input         memtoreg, pcsrc,
                 input         alusrc, regdst,
                 input         regwrite, jump,
                 input  [2:0]  alucontrol,
                 output        zero,
                 output [31:0] pc,
                 input  [31:0] instr,
                 output [31:0] aluout, writedata,
                 input  [31:0] readdata);
  wire [4:0] writereg;
  wire [31:0] pcnext, pcnextbr, pcplus4, pcbranch;
  wire [31:0] signimm, signimmsh;
  wire [31:0] srca, srcb;
  wire [31:0] result;
  // next PC logic
  flopr #(32) pcreg (clk, reset, pcnext, pc);
  adder pcadd1 (pc, 32'b100, pcplus4);
  sl2 immsh (signimm, signimmsh);
  adder pcadd2 (pcplus4, signimmsh, pcbranch);
  mux2 #(32) pcbrux (pcplus4, pcbranch, pcsrc, pcnextbr);
  mux2 #(32) pcmux (pcnextbr, {pcpluss4[31:28], instr[25:0], 2'b00}, jump, pcnext);
  // register file logic
  regfile rf (clk, regwrite, instr[25:21], instr[20:16], writereg, result, srca, writedata);
  mux2 #(5) wrmux (instr[20:16], instr[15:11], regdst, writereg);
  mux2 #(32) resmux (aluout, readdata, memtoreg, result);
  signext se (instr[15:0], signimm);
  // ALU logic
  mux2 #(32) srcbmux (writedata, signimm, alusrc, srcb);
  alu alu (srca, srcb, alucontrol, aluout, zero);
endmodule

Generic Building Blocks

This section contains generic building blocks that may be useful in any MIPS microarchitecture, including a register file, adder, left shift unit, sign-extension unit, resettable flip-flop, and multiplexer.

Register File

module regfile (input         clk,
                input         we3,
                input  [4:0]  ra1, ra2, wa3,
                input  [32:0] wd3,
                output [31:0] rd1, rd2);
  reg [31:0] rf[31:0];
  always @ (posedge clk)
    if (we3) rf[wa3] <= wd3;
  assign rd1 = (ra1 != 0) ? rf[ra1] : 0;
  assign rd2 = (ra2 != 0) ? rf[ra2] : 0;
endmodule

Adder

module adder (input  [31:0] a, b,
              output [31:0] y);
  assign y = a + b;
endmodule

Left Shift (Multiply by 4)

module sl2 (input  [31:0] a,
            output [31:0] y);
  // shift left by 2
  assign y = {a[29:0], 2'b00}
endmodule

Sign Extension

module signet (input  [15:0] a,
               output [31:0] y);
  assign y = , a};
endmodule

Resettable Flip-Flop

module flopr #(parameter WIDTH = 8)
              (input                  clk, reset,
               input      [WIDTH-1:0] d,
               output reg [WIDTH-1:0] q);
  always @ (posedge clk, posedge reset)
    if (reset) q <= 0;
    else q <= d;
endmodule

2:1 Multiplexer

module mux2 #(parameter WIDTH = 8)
             (input  [WIDTH-1:0] d0, d1,
              input              s,
              output [WIDTH-1:0] y);
  assign y = s ? d1 : d0;
endmodule

Testbench

The MIPS testbench loads a program into the memories. The program in the figure exercises all of the instructions by performing a computation that should produce the correct answer only if all of the instructions are functioning properly. Specifically, the program will write the value 7 to address 84 if it runs correctly, and is unlikely to do so if the hardware is buggy. This is an example of ad hoc testing.

The machine code is stored in a hexadecimal file called memfile.dat, which is loaded by the testbench during simulation. The file consists of the machine code for the instructions, one instruction per line.

Untitled

The testbench, top-level MIPS module, and external memory HDL code are given in the following examples. The memories in this example hold 64 words each.

MIPS Testbench

module testbench ();
  reg clk;
  reg reset;
  wire [31:0] writedata, dataadr;
  wire memwrite;
  // instantiate device to be tested
  top dut (clk, reset, writedata, dataadr, memwrite);
  // initialize test
  initial
    begin
      reset <= 1; # 22; reset <= 0;
    end
  // alway generate clock to sequence tests
  always
    begin
      clk <= 1; # 5; clk <= 0; # 5;
    end
  // check results
  always @ (negedge clk)
    begin
      if (memwrite) begin
        if (dataadr === 84 & writedata === 7) begin
          $display("Simulation succeeded");
          $stop;
        end else if (dataadr !== 80) begin
          $display("Simulation failed");
          $stop;
        end
      end
    end
endmodule

MIPS Top-Level Module

module top (input         clk, reset,
            output [31:0] writedata, dataadr,
            output        memwrite);
  wire [31:0] pc, instr, readdata;
  // instantiate processor and memories
  mips mips (clk, reset, pc, instr, memwrite, dataadr, writedata, readdata);
  imem imem (pc[7:2], instr);
  imem dmem (clk, memwrite, dataadr, writedata, readdata);
endmodule

MIPS Data Memory

module dmem (input         clk, we,
             input  [31:0] a, wd,
             output [31:0] rd);
  reg [31:0] RAM[63:0];
  assign rd = RAM[a[31:2]]; // word aligned
  always @ (posedge clk)
    if (we)
      RAM[a[31:2]] <= wd;
endmodule

MIPS Instruction Memory

module imem (input  [5:0]  a,
             output [31:0] rd);
  reg [31:0] RAM[63:0];
  initial
    begin
      $readmemh("memfile.dat", RAM);
    end
  assign rd = RAM[a]; // word aligned
endmodule

Exceptions

In this section we enhance the multicycle processor to support two types of exceptions: undefined instructions and arithmetic overflow. Supporting exceptions in other microarchitectures follows similar principles. As described previously, when an exception takes place the processor copies the PC to the EPC register and stores a code in the Cause register indicating the source of the exception. Exception causes include 0x28 for undefined instructions and 0x30 for overflow. The processor then jumps to the exception handler at memory address 0x80000180. The exception handler is code that responds to the exception. It is part of the operating system.

Also, the registers are part of Coprocessor 0, a portion of the MIPS processor that is sued for system functions. Coprocessor 0 defines up to 32 special-purpose registers, including Cause and EPC. The exception handler may use the mfc0 (move from coprocessor 0) instruction to copy these special-purpose registers into a general-purpose register in the register file; the Cause register is Coprocessor register 13, and EPC is register 14.

To handle exceptions, we must add EPC and Cause register to the datapath and extend the PCSrc multiplexer to accept the exception handler memory address, as shown. The two new registers have write enables, EPCWrite and CauseWrite, to store the PC and exception cause when an exception takes place. The cause is generated by a multiplexer that selects the appropriate code for the exception. The ALU must also generate an overflow signal, as was discussed earlier.

Untitled

To support the mfc0 instruction, we also add a way to select the Coprocessor 0 registers and write them to the register file, as shown. The mfc0 instruction specified the Coprocessor 0 register by $Instr_{15 : 11}$ ; in this diagram, only the Cause and EPC registers are supported. We add another input to the MemtoReg multiplexer to select the value from Coprocessor 0.

Untitled

The modified controller is shown in the figure. The controller receives the overflow flag from the ALU. It generates three new control signals; one to write the EPC, a second to write the Cause register, and a third to select the Cause. It also includes two new states to support the two exceptions and another state to handle mfc0.

If the controller receives an undefined instruction, it proceeds to S12, saves the PC in EPC, writes 0x28 to the Cause register, and jumps to the exception handler. Similarly, if the controller detects arithmetic overflow on an add or sub instruction, it proceeds to S13, saves the PC in EPC, writex 0x30 in the Cause register, and jumps to the exception handler. Note that, when an exception occurs, the instruction is discarded and the register file is not written. When a mfc0 instruction is decoded, the processor goes to S14 and writes the appropriate Coprocessor 0 register to the main register file.

Untitled

Advanced Microarchitecture

High-performance microprocessors use a wide variety of techniques to run programs faster. Recall that the time required to run a program is proportional to the period of the clock and to the number of clock cycles per instruction (CPI). Thus, to increase performance we would like to speed up the clock and/or reduce the CPI.

A manufacturing process is characterized by its feature size, which indicates the smallest transistor that can be reliably built. Smaller transistors are faster and generally consume less power. Thus, even if the microarchitecture does not change, the clock frequency can increase because all the gates are faster. Moreover, smaller transistors enable placing more transistors on a chip. Microarchitects use the additional transistors to build more complicated processors or to put more processors on a chip. Unfortunately, power consumption increases with the number of transistors and the speed at which they operate. Power consumption is now an essential concern. Microprocessor designers have a challenging task juggling the trade-offs among speed, power, and cost for chips with billions of transistors in some of the most complex systems that humans have ever built.

Deep Pipelines

Aside from advances in manufacturing, the easiest way to speed up the clock is to chop the pipeline into more stages. Each stage contains less logic, so it can run faster. This chapter has considered a classic five-stage pipeline, but 10 to 20 stages are now commonly used.

The maximum number of pipeline stages is limited by pipeline hazards. sequencing overhead, and cost. Longer pipelines introduce more dependencies. Some of the dependencies can be solved by forwarding, but others require stalls, which increase the CPI. The pipeline registers between each stage have sequencing overhead from their setup time and clock-to- $Q$ delay (as well as clock skew). This sequencing overhead makes adding more pipeline stages give diminishing returns. Finally, adding more stages increases the cost because of the extra pipeline registers and hardware required to handle hazards.

Untitled

Branch Prediction

An ideal pipelined processor would have a CPI of 1. The branch misprediction penalty is a major reason for increased CPI. As pipelines get deeper, branches are resolved later in the pipeline. Thus, the branch misprediction penalty gets larger, because all the instructions issued after the mispredicted branch must be flushed. To address this problem, most pipelined processors use a branch predictor to guess whether the branch should be taken. Our pipeline from before simply predicted that branches are never taken.

Some branches occur when a program reaches the end of a loop and branches back to repeat the loop. Loops tend to be executed many times, so these backward branches are usually taken. The simplest form of branch prediction checks the direction of the branch and predicts that backward branches should be taken. This is called static branch prediction, because it does not depend on the history of the program.

Forward branches are difficult to predict without knowing more about the specific program. Therefore, most processors use dynamic branch predictors, which use the history of program execution to guess whether a branch should be taken. Dynamic branch predictors maintain a table of the last several hundred (or thousand) branch instructions that the processor has executed. The table, sometimes called a branch target buffer, includes the destination of the branch and a history of whether the branch was taken.

To see the operation of dynamic branch predictors let us consider the following loop code.

  add $s1, $0, $0
  add $s0, $0, $0
  addi $t0, $0, 10 
for:
  beq $s0, $t0, done
  add $s1, $s1, $s0
  addi $s0, $s0, 1
  j for
done:

A one-bit dynamic branch predictor remembers whether the branch was taken the last time and predicts that it will do the same thing the next time. While the loop is repeating, it remembers that beq was not taken last time and predicts that it should not be taken next time. This is a correct prediction until the last branch of the loop, when the branch does not get taken. Unfortunately, if the loop is run again, the branch predictor remembers that the last branch was taken. Therefore, it incorrectly predicts that the branch should be taken when the loop is first run again. In summary, a 1-bit branch predictor mispredicts the first and last branches of a loop.

A 2-bit dynamic branch predictor solves this problem by having four states: strongly taken, weakly taken, weakly not taken, and strongly not taken, as shown. When the loop is repeating, it enters the strongly not taken state and predicts that the branch should not be taken next time. This is correct until the last branch of the loop, which is taken and moves the predictor to the weakly not taken state. When the loop is first run again, the branch predictor correctly predicts that the branch should not be taken and reenters the strongly not taken state. In summary, a 2-bit branch predictor mispredicts only the last branch of a loop.

As one can imagine, branch predictors may be used to track even more history of the program to increase the accuracy of predictions. Good branch predictors achieve better than 90% accuracy on typical programs.

The branch predictor operates in the Fetch stage of the pipeline so that it can determine which instruction to execute on the next cycle. When it predicts that the branch should be taken, the processor fetches the next instruction from the branch destination stored in the branch target buffer. By keeping track of both branch and jump destinations in the branch target buffer, the processor can also avoid flushing the pipeline during jump instructions.

Untitled

Superscalar Processor

A superscalar processor contains multiple copies of the datapath hardware to execute multiple instructions simultaneously. The figure shows a block diagram of a two-way superscalar processor that fetches and executes two instructions per cycle. The datapath fetches two instructions at a time from the instruction memory. It has a six-ported register file to read four source operands and write two results back in each cycle. it also contains two ALUs and a two-ported data memory to execute the two instructions at the same time.

Untitled

The figure shows a pipeline diagram illustrating the two-way superscalar processor executing two instructions on each cycle. For this program, the processor has a CPI of 0.5. Designers commonly refer to the reciprocal of the CPI as the instructions per cycle, or IPC. This processor has an IPC of 2 on this program.

Untitled

Executing many instructions simultaneously is difficult because of dependencies. For example, the figure shows a pipeline diagram running a program with data dependencies. The dependencies in the code are shown in blue. The add instruction is dependent on $t0, which is produced by the lw instruction, so it cannot be issued at the same time as lw. Indeed, the add instruction stalls for yet another cycle so that lw can forward $t0 to add in cycle 5. The other dependencies (between sub and and based on $t0, and between or and sw based on $t3) are handled by forwarding results produced in one cycle to be consumed in the next. This program, also given below, requires five cycles to issue six instructions, for an IPC of 1.17.

lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)

Recall that parallelism comes in temporal and spatial forms. Pipelining is a case of temporal parallelism. Multiple execution units is a case of spatial parallelism. Superscalar processors exploit both forms of parallelism to squeeze out performance far exceeding that of our single-cycle and multicycle processors.

Commercial processors may be three-, four-, or even six-way superscalar. They must handle control hazards such as branches as well as data hazards. Unfortunately, real programs have many dependencies, so wide superscalar processors rarely fully utilize all of the execution units. Moreover, the large number of execution units and complex forwarding networks consume vast amounts of circuitry and power.

Untitled

Out-of-Order Processor

To cope with the problem of dependencies, an out-of-order processor looks ahead across many instructions to issue independent instructions as rapidly as possible. The instructions can be issued in a different order than that written by the programmer, as long as dependencies are honored so that the program produces the intended result.

Consider running the same program from above on a two-way superscalar out-of-order processor. The processor can issue up to two instructions per cycle from anywhere in the program, as long as dependencies are observed. The figure shows the data dependencies and the operation of the processor. The classifications of dependencies as RAW and WAR will be discussed shortly. The constraints on issuing instructions are described below.

Cycle 1
- The lw instruction issues.
- The add, sub, and and instructions are dependent on lw by way of $t0, so they cannot issue yet. However, the or instruction is independent, so it also issues.
Cycle 2
- Remember, that there is a two-cycle latency between when a lw instruction issues and when a dependent instruction can use its result, so add cannot issue yet because of the $t0 dependence. sub writes $t0, so it cannot issue before add, lest add receive the wrong value of $t0. and is dependent on sub.
- Only the sw instruction issues.
Cycle 3
- On cycle 3, $t0 is available, so add issues. sub issues simultaneously, because it will not write $t0 until after add consumes $t0.
Cycle 4
- The and instruction issues. $t0 is forwarded from sub to and.

The out-of-order processor issues the six instructions in four cycles, for an IPC of 1.5.

Untitled

The dependence of add on lw by way of $t0 is a read after write hazard (RAW). add must not read $t0 until after lw has written it. This is the type of dependency we are accustomed to handling in the pipelined processor. It inherently limits the speed at which the program can run, even if infinitely many execution units are available. Similarly, the dependence of sw on or by way of $t3 and of and on sub by way of $t0 are RAW dependencies.

The dependence between sub and add by way of $t0 is called a write after read (WAR) hazard, or an antidependence. sub must not write $t0 before add reads $t0, so that add receives the correct value according to the original order of the program. WAR hazards could not occur in the simple MIPS pipeline, but they may happen in an out-of-order processor if the dependent instruction (sub in this case) is moved too early.

A WAR hazard is not essential to the operation of the program. It is merely an artifact of the programmer’s choice to use the same register for two unrelated instructions. If the sub instruction had written $t4 instead of $t0, the dependency would disappear and sub could be issued before add. The MIPS architecture only has 32 registers, so sometimes the programmer is forced to reuse a register and introduce a hazard just because all the other registers are in use.

A third type of hazard, not shown in the program, is called write after write (WAW) or an output dependence. A WAW hazard occurs if an instruction attempts to write a register after a subsequent instruction has already written it. The hazard would result in the wrong value being written to the register. For example, in the following program, add and sub both write $t0. The final value in $t0 should come from sub according to the order of the program. If an out-of-order processor attempted to execute sub first, the WAW hazard would occur.

add $t0, $s1, $s2
sub $t0, $s3, $s4

WAW hazards are not essential either; again, they are artifacts caused by the programmer’s using the same register for two unrelated instructions. If the sub instruction were issued first, the program could eliminate the WAW hazard by discarding the result of the add instead of writing it to $t0 this is called squashing the add.

Out-of-order processors use a table to keep track of instructions waiting to issue. The table, sometimes called a scoreboard, contains information about the dependencies. The size of the table determines how many instructions can be considered for issue. On each cycle, the processor examines the table and issues as many instructions as it can, limited by the dependencies and by the number of execution units that are available.

The instruction level parallelism (ILP) is the number of instructions that can be executed simultaneously for a particular program and microarchitecture. Theoretical studies have shown that the ILP can be quite large for out-of-order microarchitectures with perfect branch predictors and enormous amounts of execution units. However, practical processors seldom achieve an ILP greater than 2 or 3, even with six way superscalar datapaths with out-of-order execution.

Register Renaming

Out-of-order processors use a technique called register renaming to eliminate WAR hazards. Register renaming adds some nonarchitectural renaming registers to the processor. For example, a MIPS processor might add 20 renaming registers, called $r0 - $r19. The programmer cannot use these registers directly, because they are not part of the architecture. However, the processor is free to use them to eliminate hazards.

For example, in the previous section, a WAR hazard occurred between the sub and add instructions based on reusing $t0. The out-of-order processor could rename $t0 to $r0 for the sub instruction. Then sub could be executed sooner, because $r0 has no dependency on the add instruction. The processor keeps a table of which registers were renamed so that it can consistently rename registers in subsequent dependent instructions. In this example, $t0 must also be renamed to $r0 in the and instruction, because it refers to the result of sub.

The figure shows the same program from above executing on an out-of-order processor with register renaming. $t0 is renamed to $r0 in sub and and to eliminate the WAR hazard. The constraints on issuing instructions are described below.

Cycle 1
- The lw instruction issues.
- The add instruction is dependent on lw by way of $t0, so it cannot issue yet. However, the sub instruction is independent now that its destination has been renamed to $r0, so sub also issues.
Cycle 2
- Remember that there is a two-cycle latency between when a lw issues and when a dependent instruction can use its result, so add cannot issue yet because of the $t0 dependence.
- The and instruction is dependent on sub, so it can issue. $r0 is forwarded from sub to and.
- The or instruction is independent, so it also issues.
Cycle 3
- On cycle 3, $t0 is available, so add issues. $t3 is also available, so sw issues.

The out-of-order processor with register renaming issues the six instructions in three cycles, for an IPC of 2.

Untitled

Single Instruction Multiple Data

The term SIMD stands for single instruction multiple data, in which a single instruction acts on multiple pieces of data in parallel. A common application of SIMD is to perform many short arithmetic operations at once, especially for graphics processing. This is also called packed arithmetic.

For example, a 32-bit microprocessor might pack four 8-bit data elements into one 32-bit word. Packed add and subtract instructions operate on all four data elements withing the word in parallel. The figure shows a packed 8-bit addition summing four pairs of 8-bit numbers to produce four results. The word could also be divided into two 16-bit elements. Performing packed arithmetic requires modifying the ALU to eliminate carries between the smaller data elements. For example, a carry out of $a_{0} + b_{0}$ should not affect the result of $a_{1} + b_{1}$ .

Short data elements often appear in graphics processing. For example, a pixel in a digital photo may use 8 bits to store each of the red, green, and blue color components. Using an entire 32-bit word to process one of these components wastes the upper 24 bits. When the components from four adjacent pixels are packed into a 32-bit word, the processing can be performed four times faster.

SIMD instructions are even more helpful for 64-bit architectures, which can pack eight 8-bit elements, four 16-bit elements or two 32-bit elements into a single 64-bit word. SIMD instructions are also used for floating-point computations; for example, four 32-bit single-precision floating-point values can be packed into a single 128-bit word.

Untitled

Multithreading

Because the ILP of real programs tends to be fairly low, adding more execution units to a superscalar or out-of-order processor gives diminishing returns. Another problem is that memory is much slower than the processor. Most loads and stores access a smaller and faster memory, called a cache. However, when the instructions or data are not available in the cache, the processor may stall for 100 or more cycles while retrieving the information from the main memory. Multithreading is a technique that helps keep a processor with many execution units busy even if the ILP of a program is low or the program is stalled waiting for memory.

To explain multithreading, we need to define a few new terms. A program running on a computer is called a process. Computers can run multiple processes simultaneously; for example, you can play music on a PC while surfing the web and running a virus checker. Each process consists of one or more threads that also run simultaneously. For example, a word processor may have one thread handling the user typing, a second thread spell-checking the document while the user works, and a third thread printing the document. In this way, the user does not have to wait, for example, for a document to finish printing before being able to type again.

In a conventional processor, the threads only give the illusion of running simultaneously. The threads actually take turns being executed on the processor under control of the OS. When one thread’s turn ends, the OS saves its architectural state, loads the architectural state of the next thread, and starts executing that next thread. This procedure is called context switching. As long as the processor switches through all the threads fast enough, the user perceives all of the threads as running at the same time.

A multithreaded processor contains more than one copy of its architectural state, so that more than one thread can be active at a time. For example, if we extended a MIPS processor to have four program counters and 128 registers, four threads could be available at one time. If one thread stalls while waiting for the data from main memory, the processor could context switch to another thread without any delay, because the program counter and registers are already available. Moreover, if one thread lacks sufficient parallelism to keep all the execution units busy, another thread could issue instructions to the idle units.

Multithreading does not improve the performance of an individual thread, because it does not increase the ILP. However, it does improve the overall throughput of the processor, because multiple threads can use processor resources that would have been idle when executing a single thread. Multithreading is also relatively inexpensive to implement, because it replicates only the PC and register file, not the execution units and memories.

Multiprocessors

A multiprocessor system consists of multiple processors and a method for communication between the processors. A common form of multiprocessing in computer systems is symmetric multiprocessing (SMP), in which two or more identical processors share a single main memory.

The multiple processors may be separate chips or multiple cores on the same chip. Modern processors have enormous numbers of transistors available. Using the to increase the pipeline depth or to add more execution units to a superscalar processor gives little performance benefit and is a waste of power. Around the year 2005, computer architects made a major shift to build multiple copies of the processor on the same chip; thee copies are called cores.

Multiprocessors can be used to run more threads simultaneously or to run a particular thread faster. Running more threads simultaneously is easy; the threads are simply divided up among the processors. Unfortunately, typical PC users need to run only a small number of threads at any given time. Running a particular thread faster is much more challenging. The programmer must divide the thread into pieces to perform on each processor. This becomes tricky when the processors need to communicate with each other. One of the major challenges for computer designers and programmers is to effectively use large numbers of processor cores.

Other forms of multiprocessing include asymmetric multiprocessing and clusters. Asymmetric multiprocessors use separate specialized microprocessors for separate tasks. For example, a cell phone contains a digital signal processor (DSP) with specialized instructions to decipher the wireless data in real time and a separate conventional processor to interact with the user, manage the phone book, and play games. In clustered multiprocessing, each processor has its own local memory system. Clustering can also refer to a group of PCs connected together on the network running software to jointly solve a large problem.